This tutorial was originally published on DataCareer.
In this Jupyter Notebook we will retrieve data from open data portal "opendata.swiss". The portal is based on the open source project CKAN. CKAN stands for Comprehensive Knowledge Archive Network. It provides an extensive API for the metadata of the open data catalogue. This means that the information about the datasets can be retrieved from CKAN, but the data itself will have to be downloaded from the servers of the contributors ("opendata.swiss" in this cases).
In this tutorial we will take a look at the population of Swizterland using Python 3. Let's start with importing some packages we will use for this exercise.
import pprint
import requests # 2.18.4
import json # 2.0.9
import pandas as pd # 0.23.0
Like mentioned, the CKAN API functions as a catalog for datasets. We need to define the URL for "opendata.swiss".
# Package list of the Swiss open data portal
packages = 'https://opendata.swiss/api/3/action/package_list'
Let's get a list of all the datasets (called packages in CKAN) listed by "opendata.swiss".
# Make the HTTP request
response = requests.get(packages)
# Use the json module to load CKAN's response into a dictionary
response_dict = json.loads(response.content)
# Check the contents of the response
assert response_dict['success'] is True # make sure if response is OK
The titles of the datasets are in the key called result
. Let's create a new variable called datasets
and find out how many datasets there are available.
datasets = response_dict['result'] # extract all the packages from the response
print(len(datasets)) # print the total number of datasets
6868
This is quite an extensive list. We can print the last 10 to the screen to get an idea of the titles:
datasets[-10:]
['zuzuge-nach-jahr-quartier-geschlecht-altersgruppe-zivilstand-und-familienstellung-nachfuhrung-e', 'zuzuge-nach-monat-stadtquartier-geschlecht-altersgruppe-und-herkunft-seit-20135', 'zuzuge-nach-zuzugsort-stadtquartier-geschlecht-altersgruppe-und-zivilstand-seit-1993', 'zuzuge-nach-zuzugsort-und-stadtquartier-seit-1993', 'zuzuge-pers', 'zvv-fahrplan-tram-und-bus', 'zwangsnutzungen', 'zweigstellen-der-musikschule-konservatorium-zurich-mkz', 'zweiradabstellplatze-in-der-stadt-zurich2', 'zweite-vornamen-neugeborener-madchen-und-knaben-mit-wohnsitz-in-der-stadt-zurich-seit-1993']
For this exercise we will take one from this list, called "bruttoinlandprodukt". Other examples could be: 'bevolkerung', 'elektroautos', or 'bevolkerungsdaten-im-zeitvergleich'.
# Specify the package you are interested in:
package = 'bruttoinlandprodukt'
Now let's download the package/dataset information. We need to take a few steps:
# Base url for package information. This is always the same.
base_url = 'https://opendata.swiss/api/3/action/package_show?id='
# Construct the url for the package of interest
package_information_url = base_url + package
# Make the HTTP request
package_information = requests.get(package_information_url)
# Use the json module to load CKAN's response into a dictionary
package_dict = json.loads(package_information.content)
# Check the contents of the response.
assert package_dict['success'] is True # again make sure if response is OK
package_dict = package_dict['result'] # we only need the 'result' part from the dictionary
# pprint.pprint(package_dict) # pretty print the package information to screen
Did you walk through the information above? You can uncomment the last line (with pretty print) to check out the package information. Is this indeed the dataset you are interested in? If yes, then you need to download the dataset. It is also important to know the format of the dataset, for next steps. This information is also listed in the package information above.
# Get the url for the data from the dictionary
data_url = package_dict['resources'][0]['url']
print('Data url: ' + data_url)
# Print the data format
data_format = package_dict['resources'][0]['format']
print('Data format: ' + data_format)
Data url: https://github.com/StataBS/indikatoren/tree/master/data/4323.tsv Data format: TSV
Notice that this particular dataset is hosted at GitHub. When downloading from GitHub, it is better to request the raw data. We need to rewrite the URL a little bit to get there.
# If data is hosted at GitHub, always download the raw data
if data_url.startswith('https://github.com/'):
data_url = data_url.replace('https://github.com/', 'https://raw.githubusercontent.com/')
data_url = data_url.replace('tree/', '')
print('Data url: ' + data_url)
Data url: https://raw.githubusercontent.com/StataBS/indikatoren/master/data/4323.tsv
Feel free to take a follow the URL's bove. It is good to take a sneak peak so you know what the data will look like. The dataset can come in different formats, so let's specify which ones we are willing to accept and load them into a Pandas DataFrame.
# List of formats we work with in this exercise
csv = ['comma-separated-values', 'CSV', 'csv']
tsv = ['tab-separated-values', 'TSV', 'tsv']
xls = ['XLS']
# Download the data to a Pandas DataFrame. Use seperate function calls, depending on the format of the dataset.
if any(s in data_format for s in csv): # pd.read_csv()
df = pd.read_csv(data_url)
elif any(s in data_format for s in tsv): # pd.read_csv() and specify the delimiter
df = pd.read_csv(data_url, sep='\t')
elif any(s in data_format for s in xls): # pd.read_excel()
df = pd.read_excel(data_url)
else:
print('Sorry, the data format is not supported for this exercise')
# Print the first rows to the screen to inspect the dataset
df.head(5)
Jahr | Bruttoinlandprodukt | DateTime | |
---|---|---|---|
0 | 1980 | 8219.2 | |
1 | 1981 | 8754.3 | |
2 | 1982 | 9459.6 | |
3 | 1983 | 9879.3 | |
4 | 1984 | 10619.4 |
As you can see, we need to make a few adjustments before we can continue. It is best to clean up the dataset before you start doing your analysis.
# Remove the column 'DateTime', because it is empty
df.drop('DateTime', axis=1, inplace=True)
# Make 'Jahr' the index
df.set_index('Jahr', inplace=True)
That's it! Now let's visualise the data with Pandas built-in plot functionality, which is based on 'matplotlib'.
# Use IPython's "magic" in Jupyter Notebook to directly show the plot on the screen.
%matplotlib inline
df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x115c8ada0>